INTERSPEECH.2006 - Analysis and Assessment | Cool Papers

#1 Integrating Festival and Windows [PDF] [Copy] [Kimi]

Authors: Rhys James Jones ; Ambrose Choy ; Briony Williams

Festival is a popular open-source development and execution environment for speech synthesis. It has been well-integrated within many environments, particularly Unix ones, but so far has not been easy to integrate natively into Windows. We present two solutions to this: an MSAPI interface, which allows Festival voices to work with a range of speech-enabled Windows applications, and SpeechServer, a client-server architecture which allows Festival to operate within a Flash (or other) application within a web browser. While the motivation for this work was to enable new Welsh diphone Festival voices to be used within screenreaders and other Windows programs, the MSAPI interface is now modularised, allowing it to work with any Festival voice.

#2 Measuring the acceptable word error rate of machine-generated webcast transcripts [PDF] [Copy] [Kimi]

Authors: Cosmin Munteanu ; Gerald Penn ; Ron Baecker ; Elaine Toms ; David James

The increased availability of broadband connections has recently led to an increase in the use of Internet broadcasting (webcasting). Most webcasts are archived and accessed numerous times retrospectively. One of the hurdles users face when browsing and skimming through archives is the lack of text transcripts of the audio channel of the webcast archive. In this paper, we proposed a procedure for prototyping an Automatic Speech Recognition (ASR) system that generates realistic transcripts of any desired Word Error Rate (WER), thus overcoming the drawbacks of both prototype-based and Wizard of Oz simulations. We used such a system in a study where human subjects perform question-answering tasks using archives of webcast lectures, and showed that their performance and perception of transcript quality is linearly affected by WER, and that transcripts of WER equal or less than 25% would be acceptable for use in webcast archives.

#3 Analyzing reusability of speech corpus based on statistical multidimensional scaling method [PDF] [Copy] [Kimi]

Authors: Goshu Nagino ; Makoto Shozakai

In order to develop a target speech recognition system with less cost of time and money, reusability of existing speech corpora is becoming one of the most important issues. This paper proposes a new technique of applying a statistical multidimensional scaling method to analyze the reusability of a speech corpus. In the experiment using six speech corpora, which contains isolated words and short sentences used in car navigation system, an effect of the proposed method is evaluated by a usual approach of cross task recognition. Furthermore, the relationship among those speech corpora is clearly shown by the proposed method.

#4 Redundancy and productivity in the speech technology lexicon - can we do better? [PDF] [Copy] [Kimi]

Authors: Susan Fitt ; Korin Richmond

Current lexica for speech technology typically contain much redundancy, while omitting useful information. A comparison with lexica in other media and for other purposes is instructive, as it highlights some features we may borrow for text-to-speech and speech recognition lexica.

#5 Word intelligibility estimation of noise-reduced speech [PDF] [Copy] [Kimi]

Authors: Takeshi Yamada ; Masakazu Kumakura ; Nobuhiko Kitawaki

It is indispensable to establish an objective test methodology for noise-reduced speech. This paper proposes a new methodology which estimates word intelligibility of the noise-reduced speech from PESQ MOS (subjective MOS estimated by the PESQ). To evaluate the effectiveness of the proposed methodology, a word intelligibility test of the noise-reduced speech was performed by using four noise reduction algorithms and word lists which take word difficulty into account, and then the word intelligibility was estimated by the proposed methodology. The results confirmed that the word intelligibility can be estimated well from the PESQ MOS without distinguishing the noise reduction algorithms and the noise types.

#6 Exploring the unknown - collecting 1000 speakers over the internet for the ph@ttsessionz database of adolescent speakers [PDF] [Copy] [Kimi]

Author: Christoph Draxler

The Ph@ttSessionz project will create a database of 1000 adolescent German speakers. The project employs a novel approach to collecting speech data: recordings are being performed via the WWW in more than 35 schools in Germany, and the data is immediately transferred to the BAS server in Munich. Using this approach, geographically distributed recordings in high bandwidth quality can be performed efficiently and reliably. The paper presents the infrastructure developed at BAS for WWW-based speech recordings, it discusses the strategies employed to get schools to participate in the project, and it presents preliminary analyses of the speech database.

#7 A new single-ended measure for assessment of speech quality [PDF] [Copy] [Kimi]

Authors: Timothy Murphy ; Dorel Picovici ; Abdulhussain E. Mahdi

This paper proposes a new non-intrusive measure for objective speech quality assessment in telephony applications and evaluates its performance. The measure is based on estimating perception-based objective auditory distances between voiced parts of the degraded speech under test and an appropriately formulated artificial reference model of clean speech signals. The reference model is extracted from one or many pre-formulated speech reference books. The reference books are formed by optimally clustering large number of parametric speech vectors extracted from a database of clean speech signals, using an efficient K dimensional tree structure. The measured auditory distances are then mapped into objective listening quality scores. Reported evaluation results show that the proposed measure offers sufficiently accurate and low-complexity assessment method of speech quality, making it suitable for real time applications.

#8 Speech technology for minority languages: the case of Irish (gaelic) [PDF] [Copy] [Kimi]

Authors: Ailbhe Ní Chasaide ; John Wogan ; Brian Ó Raghallaigh ; Áine Ní Bhriain ; Eric Zoerner ; Harald Berthelsen ; Christer Gobl

The development of speech technology could play an important role in the maintenance and preservation of minority languages, especially where the population of native speakers are dwindling. This paper outlines the efforts within the WISPR project, to develop annotated spoken corpora along with some of the prerequisites for the synthesis of Irish (Gaelic). It details the particular challenges that have confronted us as well as the strategies adopted to overcome them. It highlights the need for gearing our methodologies to these constraints and to maximise the reusability of resources. Our long-term goal is not only to develop these resources for Irish, but also, in parallel, to develop methodologies that will enable the technology to be flexible and suitable to the envisaged end users, e.g., more flexible kinds of synthesisers, with expressive capabilities and multiple voices, including childrens. It is therefore a major consideration to develop resources in such a way that they are in some sense independent of any single methodology (unit selection vs. other modalities for synthesis development).

#9 Further investigations on the relationship between objective measures of speech quality and speech recognition rates in noisy environments [PDF] [Copy] [Kimi]

Authors: Francisco José Fraga ; Carlos Alberto Ynoguti ; André Godoi Chiovato

The relationship between an objective measure of speech quality (PESQ) and the recognition rate of a given speech recognition system was already investigated by other researchers. In this paper, we present a further investigation on such a relationship. In our research, the speech recognition tests were performed on a wider class of signals and SNR. The experimental setup as well as the speech recognition systems now evaluated were based on the directions given by the Aurora project. Moreover, a new parametric modeling approach for the PESQ-MOS versus speech recognition rate curve, based on the logistic function, is proposed. This new modeling allows some meaningful interpretations of the parameters of the logistic function in terms of system robustness, and permits to make inferences in the regions outside the experimental measures. Furthermore, the PESQ versus SNR characteristic was used to group types of noise, leading to a much better fit of the logistic function over the data points.

#10 Non-intrusive speech quality assessment with low computational complexity [PDF] [Copy] [Kimi]

Authors: Volodya Grancharov ; David Y. Zhao ; Jonas Lindblom ; W. Bastiaan Kleijn

We describe an algorithm for monitoring subjective speech quality without access to the original signal that has very low computational and memory requirements. The features used in the proposed algorithm can be computed from commonly used speech-coding parameters. Reconstruction and perceptual transformation of the signal are not performed. The algorithm generates quality assessment ratings without explicit distortion modeling. The simulation results indicate that the proposed non-intrusive objective quality measure performs better than the ITU-T P.563 standard despite its very low computational complexity.

#11 Using speech recognition technique for constructing a phonetically transcribed taiwanese (min-nan) text corpus [PDF] [Copy] [Kimi]

Authors: Min-Siong Liang ; Ren-Yuan Lyu ; Yuang-Chin Chiang

Collection of Taiwanese text corpus with phonetic transcription suffers from the problems of multiple pronunciation variation. By augmenting the text with speech, and using automatic speech recognition with a sausage searching net constructed from the multiple pronunciations of the text corresponding to its speech utterance, we are able to reduce the effort for phonetic transcription. By using the multiple pronunciation lexicon, the error rate of transcription 13.94% was achieved. Further improvement can be achieved by adapting the pronunciation lexicon with pronunciation variation (PV) rules derived from a manual corrected speech corpus. The PV rules can be categorized into two kinds: the knowledge-based and data-driven rules. By incorporating the PV rules, the error rate reduction 13.63% could be achieved. Although the technique was developed for Taiwanese speech, it could also be adapted easily to be applied in the other similar "minority" Chinese spoken languages.

#12 Sloparl - slovenian parliamentary speech and text corpus for large vocabulary continuous speech recognition [PDF] [Copy] [Kimi]

Authors: Andrej Zgank ; Tomas Rotovnik ; Matej Grasic ; Marko Kos ; Damjan Vlaj ; Zdravko Kacic

This paper present a novel Slovenian language resource - SloParl database. It consists from debates acquired in the Slovenian Parliament. The main goal of the project was to cost-effectively collect a new Slovenian language resource that could be used to augment the available Slovenian speech corpora for developing a large vocabulary continuous speech recognition system. The SloParl speech corpus has a total length of 100 hours. The selected sessions between years 2000-2005 were incorporated in it. This speech corpus will be used for lightly supervised or unsupervised acoustic models training. In accordance with this, the accompanying transcriptions were prepared. The second part of the SloParl database is the text corpus, which covers text of all debates from period 1996-2005. It consists of 23M words. It will be used to create different types of speech recognisers language models. Comparison with other Slovenian language resources showed that SloParl database adds new aspects to the modelling of Slovenian language.

#13 An annotation scheme for agreement analysis [PDF] [Copy] [Kimi]

Authors: Siew Leng Toh ; Fan Yang ; Peter A. Heeman

To accomplish a task that requires collaboration, people would first agree on a strategy and then together carry it out [1]. Our research interest lies in understanding how people explore different strategies and reach an agreement in conversation. We began by examining two-person dialogues in a very limited domain, in which we could just focus on the agreement process. In this paper, we describe an annotation scheme of coding the conversants behaviors of exploring possible strategies, suggesting and accepting the optimal one, and then maintaining it. We report the inter-coder reliability of the annotation scheme on three expert annotators and two non-experts.

#14 Conversational quality estimation model for wideband IP-telephony services [PDF] [Copy] [Kimi]

Authors: Hitoshi Aoki ; Atsuko Kurashima ; Akira Takahashi

As broadband and high-speed IP networks spread, IP-telephony services have become a popular speech communication application over IP networks. Recently, the speech quality of IP-telephony services has become close to that of conventional PSTN services. To provide better speech quality to users, speech communication with wider bandwidth (e.g., 7 kHz) is one of the most promising applications. To ensure desirable quality, we should design the quality before services start and manage it while they are being provided. To do this, an effective means for estimating users perceptions of speech quality is indispensable. This paper describes a model for estimating the conversational quality of wideband IP-telephony services from physical characteristics of terminals and networks. The proposed model takes into account the quality enhancement effect achieved by widening speech bandwidth and has the advantage that it can evaluate the quality of both wideband and telephone-band speech on the same scale. Based on subjective conversational quality evaluation experiments, we show that the proposed model can accurately estimate the subjective quality for wideband speech as well as for telephone-band speech.

#15 The vocal joystick data collection effort and vowel corpus [PDF] [Copy] [Kimi]

Authors: Kelley Kilanski ; Jonathan Malkin ; Xiao Li ; Richard Wright ; Jeff A. Bilmes

Vocal Joystick is a mechanism that enables individuals with motor impairments to make use of vocal parameters to control objects on a computer screen (buttons, sliders, etc.) and ultimately will be used to control electro-mechanical instruments (e.g., robotic arms, wireless home automation devices). In an effort to train the VJ-system, speech data from the TIMIT speech corpus was initially used. However, due to problematic issues with co-articulation, we began a large data collection effort in a controlled environment that would not only address the problematic issues, but also yield a new vowel corpus that was representative of the utterances a user of the VJ-system would use. The data collection process evolved over the course of the effort as new parameters were added and as factors relating to the quality of the collected data in terms of the specified parameters were considered. The result of the data collection effort is a vowel corpus of approximately 11 hours of recorded data comprised of approximately 23500 sound files of the monophthongs and vowel combinations (e.g. diphthongs) chosen for the Vocal Joystick project varying along the parameters of duration, intensity and amplitude. This paper discusses how the data collection has evolved since its initiation and provides a brief summary of the resulting corpus.

#16 Comparison of the ITU-t p.85 standard to other methods for the evaluation of text-to-speech systems [PDF] [Copy] [Kimi]

Authors: Dmitry Sityaev ; Katherine Knill ; Tina Burrows

Evaluation of TTS systems is essential to assess performance. The ITUT P.85 standard was introduced in 1994 to assess the overall quality of speech synthesis systems. However it has not been widely accepted or used. This paper compares the ITU test to more commonly used tests for intelligibility (semantically unpredictable sentences (SUS)) and naturalness (mean opinion score based). The aim of this research was to determine if the ITU test can provide a better performance measure and/or supplementary information to help evaluate TTS systems.

#17 An annotation scheme for complex disfluencies [PDF] [Copy] [Kimi]

Authors: Peter A. Heeman ; Andy McMillin ; J. Scott Yaruss

In this paper, we present an annotation scheme for disfluencies. Unlike previous schemes, this scheme allows complex disfluencies with multiple backtracking points to be annotated, which are common in stuttered speech. The scheme specifies each disfluency in terms of word-level annotations, thus making the scheme useful for building sophisticated language models of disfluencies. As determining the annotation codes is quite difficult, we have developed a pen and paper procedure in which the annotator lines up the words into rows and columns, from which it is straight-forward for the annotator to determine the annotation tags.

#18 Automatic phonetic transcription of large speech corpora: a comparative study [PDF] [Copy] [Kimi]

Authors: Christophe Van Bael ; Lou Boves ; Henk van den Heuvel ; Helmer Strik

This study investigates whether automatic transcription procedures can approximate manual phonetic transcriptions typically delivered with contemporary large speech corpora. We used ten automatic procedures to generate a broad phonetic transcription of well-prepared speech (read-aloud texts) and spontaneous speech (telephone dialogues). The resulting transcriptions were compared to manually verified phonetic transcriptions. We found that the quality of this type of transcription can be approximated by a fairly simple and cost-effective procedure.

#19 Examining knowledge sources for human error correction [PDF] [Copy] [Kimi]

Authors: Yongmei Shi ; Lina Zhou

A variety of knowledge sources have been employed by error correction mechanisms to improve the usability of speech recognition (SR) technology. However, little is known about the effect of knowledge sources on human error correction. Advancing our understanding of the role of knowledge sources in human error correction could improve the state of automatic error correction. We selected three knowledge sources, including alternative list, imperfect context, and perfect context, and compared their usefulness to human error correction via an empirical user study. The results showed that knowledge sources had significant impact on the performance of human error correction. In particular, perfect context was the best that could significantly reduce word error rate without increasing the processing time.

#20 Unsupervised language model adaptation based on automatic text collection from WWW [PDF] [Copy] [Kimi]

Authors: Motoyuki Suzuki ; Yasutomo Kajiura ; Akinori Ito ; Shozo Makino

An n-gram trained by a general corpus gives high performance. However, it is well known that a topic-specialized n-gram gives higher performance than that of the general n-gram. In order to make a topic specialized n-gram, several adaptation methods were proposed. These methods use a given corpus corresponding to the target topic, or collect documents related to the topic from a database. If there is neither the given corpus nor the topic-related documents in the database, the general n-gram cannot be adapted to the topic-specialized n-gram. In this paper, a new unsupervised adaptation method is proposed. The method collects topic-related documents from the world wide web. Several query terms are extracted from recognized text, and collected web pages given by a search engine are used for adaptation. Experimental results showed the proposed method gave 7.2 points higher word accuracy than that given by the general n-gram.

#21 Unsupervised language model adaptation using latent semantic marginals [PDF] [Copy] [Kimi]

Authors: Yik-Cheung Tam ; Tanja Schultz

We integrated the Latent Dirichlet Allocation (LDA) approach, a latent semantic analysis model, into unsupervised language model adaptation framework. We adapted a background language model by minimizing the Kullback-Leibler divergence between the adapted model and the background model subject to a constraint that the marginalized unigram probability distribution of the adapted model is equal to the corresponding distribution estimated by the LDA model - the latent semantic marginals. We evaluated our approach on the RT04 Mandarin Broadcast News test set and experimented with different LM training settings. Results showed that our approach reduces the perplexity and the character error rates using supervised and unsupervised adaptation.

#22 Unsupervised language model adaptation for Mandarin broadcast conversation transcription [PDF] [Copy] [Kimi]

Authors: David Mrva ; Philip C. Woodland

This paper investigates unsupervised language model adaptation on a new task of Mandarin broadcast conversation transcription. It was found that N-gram adaptation yields 1.1% absolute character error rate gain and continuous space language model adaptation done with PLSA and LDA brings 1.3% absolute gain. Moreover, using broadcast news language model alone trained on large data under-performs a model that includes additional small amount of broadcast conversations by 1.8% absolute character error rate. Although, broadcast news and broadcast conversation tasks are related, this result shows their large mismatch. In addition, it was found that it is possible to do a reliable detection of broadcast news and broadcast conversation data with the N-gram adaptation.

#23 Language model adaptation for tiny adaptation corpora [PDF] [Copy] [Kimi]

Author: Dietrich Klakow

In this paper we address the issue of building language models for very small training sets by adapting existing corpora. In particular we investigate methods that combine task specific unigrams with longer range models trained on a background corpus. We propose a new method to adapt class models and show how fast marginal adaptation can be improved. Instead of estimating the adaptation unigram only on the adaptation corpus, we study specific methods to adapt unigram models as well. In extensive experimental studies we show the effectiveness of the proposed methods. As compared to FMA as described in [1] we obtain an improvement of nearly 60% for ten utterances of adaptation data.

#24 Pronunciation dependent language models [PDF] [Copy] [Kimi]

Author: Andrej Ljolje

Speech recognition systems are conventionally broken up into phonemic acoustic models, pronouncing dictionaries in terms of the phonemic units in the acoustic model and language models in terms of lexical units from the pronouncing dictionary. Here we explore a new method for incorporating pronunciation probabilities into recognition systems by moving them from the pronouncing lexicon into the language model. The advantages are that pronunciation dependencies across word boundaries can be modeled including contextual dependencies like geminates or consistency in pronunciation style throughout the utterance. The disadvantage is that the number of lexical items grows proportionally to the number of pronunciation alternatives per word and that language models which could be trained using text, now need phonetically transcribed speech or equivalent training data. Here this problem is avoided by only considering the most frequent words and word clusters. Those new lexical items are given entries in the dictionary and the language model dependent on the chosen pronunciation. The consequence is that pronunciation probabilities are incorporated into the language model and removed form the dictionary, resulting in an error rate reduction. Also, the introduction of pronunciation dependent word pairs as lexical items changes the behavior of the language model to approximate higher order n-gram language models, also resulting in improved recognition accuracy.

#25 Improving perplexity measures to incorporate acoustic confusability [PDF] [Copy] [Kimi]

Authors: Amit Anil Nanavati ; Nitendra Rajput

Traditionally, Perplexity has been used as a measure of language model performance to predict its goodness in a speech recognition system. However this measure does not take into account the acoustic confusability between words in the language model. In this paper, we introduce Equivocality - modification of the perplexity measure for it to incorporate the acoustic features of words in a language. This gives an improved measuring criterion that matches much better with the recognition results than conventional Perplexity measure. The acoustic distance is used as a feature to represent the acoustic characteristic of the language model. This distance is measurable only with the acoustic model parameters and does not require any experimentation. We derive the Equivocality measure and calculate it for a set of grammars. Speech recognition experiments further justify the appropriateness of using Equivocality over Perplexity.